Data Analysis Project

Author

Vilmantas Gėgžna

Modified

2023-02-04

Data analysis tools: Python, R
Helper tools: Quarto, RStudio, Git
Skills:

Abbreviations

  • CI – 95% confidence interval.
  • CV – Cross-validation.
  • Dim – dimension (principal component, the same as PC).
  • n – either sample or group size.
  • p – p-value.
  • p_adj – p-value (adjusted).
  • PCA – principal component analysis.
  • PC – principal component.
  • r – correlation coefficient.
  • R² – r squared, coefficient of determination.
  • RF – random forest.
  • RNG – (pseudo)random number generator.

1 Introduction

Wine quality assessment and certification are important for wine making and selling processes. Certification helps to prevent counterfeiting and in this way protects people’s health as well as assures quality for the wine market. To identify the most influential factors is crucial for effective quality evaluation while to classify wines into quality groups is useful for setting prices (Cortez et al. 2009).

Vinho verde is wine from the Minho region, which is located in the northwest part of Portugal. In this project, data of red vinho verde wine specimens are investigated. The dataset was originally provided by (Cortez et al. 2009). This project mainly focuses on:

  1. prediction quality (to investigate how accurately (1) wine quality and (2) alcohol content in wine can be predicted) and
  2. explanation (to identify, which are the most important factors for the prediction tasks).

1.1 The Dataset

Various aspects of the dataset are described in various sources. Pieces of description from article of (Cortez et al. 2009) as well as from websites (Red Wine Quality — Kaggle.com”; Wine Quality Datasets — www3.dsi.uminho.pt”) were copied, merged and presented in this sub-section.

Variables based on physicochemical tests:

  1. fixed acidity (\(g_{\textrm{ tartaric acid}}/l\)): most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
  2. volatile acidity (\(g_{\textrm{ acetic acid}}/l\)): the amount of acetic acid in wine, which at too high levels can lead to an unpleasant, vinegar taste.
  3. citric acid (\(g/l\)): found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
  4. residual sugar (\(g/l\)): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
  5. chlorides (\(g_{\textrm{ sodium chloride}}/l\)): the amount of salt in the wine.
  6. free sulfur dioxide (\(mg/l\)): the free form of SO₂ exists in equilibrium between molecular SO₂ (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
  7. total sulfur dioxide (\(mg/l\)): amount of free and bound forms of SO₂; in low concentrations, SO₂ is mostly undetectable in wine, but at free SO₂ concentrations over 50 ppm, SO₂ becomes evident in the nose and taste of wine.
  8. density (\(g/cm^3\)): the density of water is close to that of water depending on the percent alcohol and sugar content.
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
  10. sulphates: a wine additive which can contribute to sulfur dioxide gas (SO₂) levels, which acts as an antimicrobial and antioxidant.
  11. alcohol (% vol.): the percent alcohol content of the wine.

Variable based on sensory data:

  1. quality (score between 0 and 10).

2 Methods

2.1 Additional Technical Tasks

Use Python and R simultaneously

RStudio is IDE for R and Python. It provides interface between R and Python for interactive analysis. So technical tasks were to:

  • Try RStudio capabilities for Python programming (documentation, data analysis, plotting, etc.).
  • Try RStudio interface between R and Python: use these 2 tools in the same file.
Use Quarto with knitr Engine

Quarto is a full-featured system for technical and scientific publishing based on Pandoc. It works with jupyter (uses “Python”, “IR” or other kernels) and knitr (multilingual) engines, can be integrated with iPython Notebooks (.ipynb) or used with plain-text Quarto (.qmd)

  • Try Quarto capabilities for reproducibility, bi-lingual (R and Python) analysis and technical publishing.

Main principles:

  • Python is used for all steps which are directly related to putting model into production.
  • Some exploratory analysis steps and other tasks that are easier to perform in R, are done in R.
A thoughtful decision

Before deciding to do a bilingual project, I consulted my batch STL. He encouraged doing it and, later, to share my insights in the stand-up.

The reasons, pros and cons related to these technical tasks will be discussed in more detail during oral presentation.

2.2 Population and Sample

The population of interest is red vinho verde wines. It is assumed that the data is a simple random sample or its equivalent that allows using methods of inferential statistics.

2.3 Training and Test Sets

As the prediction step will be included in the analysis, the whole sample is divided into model training and test sets using 80:20 train/test hold-out strategy with stratification by wine quality scores.

Training set was used for:

  • exploratory analysis;
  • statistical inference;
  • model creation and selection.

Test set was used only to evaluate the performance of the final models.

2.4 General: Inferential Statistics

For statistical inference, significance level \(\alpha=0.05\) and 95% confidence level are used.

2.5 Differences Between Groups

For statistical inference about several groups (categories), the following strategy is used:

  • first, 95% confidence intervals (CI) for each group are calculated,
  • second, an omnibus test is performed,
  • In the analysis of counts (sizes of wine quality groups),
    the following methods are used:
    • confidence intervals: Goodman’s simultaneous confidence intervals of proportions;
    • omnibus: Pearson’s chi-square (χ²) goodness-of-fit (GOF) test,
      • hypotheses:
        \(H_0:\) All proportions are equal: \(~ \pi_1 = \pi_2 = ... = \pi_i = ... = \pi_k\)
        \(H_1:\) At least two proportions differ: \(\pi_i \ne \pi_j\) for at least single pair of i and j.
    Here \(\pi\) (pi) is a proportion (relative frequency, percentage) of values in certain group,
    \(i\) and \(j\) are group indices (\(i\) = 1, 2, …, \(k\); \(j\) = 1, 2, …, \(k\); \(i \ne j\)),
    \(k\) – total number of groups.

2.6 Correlation Between Variables

For relationship between numeric variables, parametric Pearson’s correlation analysis is used.

  • Hypotheses:
    \(H_0:\) \(\rho = 0\) (variables do not correlate),
    \(H_1:\) \(\rho \ne 0\) (variables correlate).

    Here \(\rho\) (rho) is population Pearson’s correlation coefficient.

In statistical significance testing, as hypotheses are tested multiple times, to prevent inflated type 1 error rate, Holm correction is applied.

2.7 Regression Task

To predict alcohol content, linear regression is used. For feature selection, several strategies were applied:

  • statistical inference and assumption checking based strategy,
  • sequential feature selection (sfs) with 5-fold cross validation.

Main metric for feature selection was .

2.8 Classification Task

To predict wine quality, logistic regression is used. For feature selection, statistical inference based strategy was used.

2.9 Feature Engineering

In some cases data were right-skewed. So this type of variables were investigated on the the original scale (not transformed) and after logarithmic transformation.

3 Preparation and Inspection

3.1 Setup

Both R and Python will be used in this project. R code is marked with blue and Python code with yellow stripes.

Code: R setup
# R setup ================================================
# R packages ---------------------------------------------
library(tidyverse)
library(reticulate)
library(factoextra)
library(DescTools)
library(patchwork)

# R options ----------------------------------------------
options(scipen = 5)
DescToolsOptions(stamp = NULL, digits = 3)

# Default ggplot2 theme
theme_set(theme_bw())

# Other options
knitr::opts_chunk$set(fig.height = 3.5, fig.width = 6, fig.align = "center")

# RNG state
set.seed(100)
Code: Python setup
# Python setup =========================================

# Packages and modules ---------------------------------

# Data wrangling, math
import numpy as np
import pandas as pd
import janitor
from numpy import log

# Statistical analysis and modeling
import patsy
import scipy.stats as sps
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.formula.api as smf
import pingouin as pg
import scikit_posthocs as sp
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error as mse
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier 
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Python Settings -----------------------------------------
# Default plot options
plt.rc("figure", titleweight="bold")
plt.rc("axes", labelweight="bold", titleweight="bold")
plt.rc("font", weight="normal", size=10)
plt.rc("figure", figsize=(7, 3))

# Pandas options
pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 20)
pd.set_option("display.max_colwidth", 40)  # Possible option: None
pd.set_option("display.float_format", lambda x: f"{x:.3f}")
pd.set_option("styler.format.thousands", ",")

# RNG state
np.random.seed(100)
Code: Ad-hoc Python functions
# Ad-hoc Python functions
def group_quality_scores(score):
  """
  Wine quality score will be recoded as quality group as follows:
  
  - score ≤ 4 = low,
  - score [5-6] = medium,
  - score ≥ 7 = high.
  
  args:
    score (int, float): numeric value from 0 to 10
  """
  assert 0 <= score <= 10
  score = int(score)
  
  if score <= 4:
      return "low"
  elif 5 <= score <= 6:
      return "medium"
  elif 7 <= score:
      return "high"


# Compute the vif for all given features
def compute_vif(X, sort=True):
    """Compute variance inflation factor, VIF
    
    Args:
      X (pandas.DataFrame): features without the target variable.
      sort (bool): True - sort VIF values
      
    Return:
      vif (pandas.DataFrame): dataframe vith columns "feature" and "vif"
    """
    # Requires
    # statsmodels.stats.outliers_influence.variance_inflation_factor
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    
    # The calculation of variance inflation REQUIRES a constant
    # as calculations are based on `statsmodels`
    X['(intercept)'] = 1
    
    # Create data frame to store VIF values
    vif = pd.DataFrame()
    vif["feature"] = X.columns
    vif["vif"] = [
      variance_inflation_factor(X.values, i) for i in range(X.shape[1])
    ]
    # Remove intercept (constant) term
    vif = vif[vif['feature'] != '(intercept)']
    
    # Sort
    if sort:
      vif = vif.sort_values("vif", ascending = False)
    
    # Output
    return vif


def format_p0(p):
    """Format p values at 3 decimal places.

    Args:
        p (float): p value (number between 0 and 1).
    """
    if p < 0.001:
        return "<0.001"
    elif p > 0.999:
        return ">0.999"
    else:
        return f"{p:.3f}"


def get_rf_importance(formula, data, y_1d=True):
  """Get feature importance from random forests regression model.
  
  Args:
    formula (str): R-like model formula
    data (pandas.DataFrame): dataset
    y_1d (bool): Should y model matrix be 1D array?
    
  Return: 
    pandas.DataFrame with columns "terms" and "importance".
  """
  y, X = patsy.dmatrices(formula, data)
  
  if y_1d:
    y = y.ravel()
    
  model = RandomForestRegressor().fit(X, y);
  importance = model.feature_importances_
  return (pd.DataFrame({
    "terms": X.design_info.column_names,
    "importance": model.feature_importances_
    })
    .sort_values("importance", ascending=False)
    )


def pairwise_correlation(data, method="pearson"):
  """Correlation analysis (including p value calculation)
  
  Args:
    data (pandas.DataFrame): dataset with numeric variables
    method (str): correlation method, e.g., "pearson" or "spearman".

  Return: 
    pandas.DataFrame with the resilts of correlation analysis.
  
  """
  return (pg.pairwise_corr(data, method=method, padjust="holm")
    .sort_values(by='r', key=abs, ascending=False)
    .transform_column("p-corr", format_p0, "p_adj")
    .select_columns(["X", "Y", "r", "CI95%", "p_adj"])
  )


# Functions for regression/classification
def do_sfs_linear_regression(formula, data, forward=True):
    """Do sequential feature selection for Linear Regtession
    
    Args.:
      formula (str): R style formula
      data (pandas.DataFrame)
      forward (bool): True – forward selection
                      False – backward selection
    """
    lin_regression = LinearRegression(fit_intercept=True)
    
    sfs = SequentialFeatureSelector(
      lin_regression, k_features="parsimonious", forward=forward, floating=True,
      scoring="r2", verbose=0, cv=5
    )
    
    y, X = patsy.dmatrices(
      formula + "+ 0", data, return_type='dataframe'
    )
    
    return sfs.fit(X, y)


def get_sfs_performance(sfs_object, n_features):
  """Return performance measures with certain numer of features.
  
  Args.:
      sfs_object: result of function do_sfs_linear_regression()
      n_features (int): number of features.
  """
  md = round(
    np.median(sfs_object.get_metric_dict()[n_features]['cv_scores']),
    3
  )
  return ({
    "n_features": n_features,
    "mean R²": round(sfs_object.get_metric_dict()[n_features]['avg_score'], 3),
    "median R²": md,
    "sd R²": round(sfs_object.get_metric_dict()[n_features]['std_dev'], 3)
  })


def show_sfs_results_lin_reg(sfs_object, sub_title=""):
    """Show results of do_sfs_linear_regression()
    
    Args.:
      sfs_object: result of function do_sfs_linear_regression()
      sub_title (str): second line of title.
    """
    if sfs_object.forward:
      sfs_type = "Forward"
    else:
      sfs_type = "Backward"
      
    
    plt.clf()
    fig, ax = plt.subplots(1, 2)

    avg_score = [(i, c["avg_score"]) for i, c in sfs_object.subsets_.items()]
    (
      pd.DataFrame(avg_score, columns=["n_features", "avg_score"])
      .plot.scatter(
        x="n_features", 
        y="avg_score", 
        xlabel="Number of features included",
        ylabel="Average R²",
        ylim=(0.1, 0.8),
        ax=ax[0]
        )
    );
    
    cv_scores = {i:c["cv_scores"] for i, c in sfs_object.subsets_.items()}
    (
      pd.DataFrame(cv_scores)
      # .drop([1, 2, 3], axis=1)
      .plot.box(
        xlabel="Number of features included",
        ylabel="R²",
        ylim=(0.1, 0.8),
        ax=ax[1]
        )
    );
    
    if not sfs_object.forward:
      ax[1].invert_xaxis();
    
    main_title = (
        f"{sfs_type} Feature Selection with {sfs_object.cv}-fold CV " +
        f"\n{sub_title}"
    )

    fig.suptitle(main_title);
    plt.tight_layout();
    plt.show()
    
    # Print results
    n_selected = f"n = {len(sfs_object.k_feature_names_)}"
    r_sq_selected = f"avg. R² = {sfs_object.k_score_:.3f}"
    
    print(main_title)
    print("\nSelected variables:")
    print(f"[ {n_selected}, {r_sq_selected} ]\n")
    for i, name in enumerate(sfs_object.k_feature_names_, start=1):
      print(f"{i}. {name}")

3.2 Import Data

Data was provided as a text file.

Preview of data file

A few lines from the text file with data:

R code
read_lines("data/winequality-red.csv", n_max = 6) |> writeLines()
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
Python code
wine_raw = pd.read_csv("data/winequality-red.csv", delimiter=";")

The dataset contains 1599 rows and 12 columns. All variables are numeric. No missing values were found.

Details: inspect wine_raw dataset

The properties of the imported dataset:

Python code
wine_raw.shape
(1599, 12)
R code
head(py$wine_raw)
R code
glimpse(py$wine_raw)
Rows: 1,599
Columns: 12
$ `fixed acidity`        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7…
$ `volatile acidity`     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600…
$ `citric acid`          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00,…
$ `residual sugar`       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.…
$ chlorides              <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069…
$ `free sulfur dioxide`  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, …
$ `total sulfur dioxide` <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 10…
$ density                <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978,…
$ pH                     <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39,…
$ sulphates              <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47,…
$ alcohol                <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 1…
$ quality                <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5,…

Number of missing values:

Python code
wine_raw.isna().sum()
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64
Python code
wine_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

3.3 Pre-Processing and Inspection

3.3.1 Recoding and Renaming Variables

Variable names will be changed into lower snake case (e.g., to make them easier to be used in statistical model formulas). Wine quality score will be merged into quality groups as follows:

  • ≤ 4 – low,
  • [5-6] – medium,
  • ≥ 7 – high.
Python code
wine = wine_raw.clean_names()

wine['quality_gr'] = pd.Categorical(
  wine['quality'].apply(group_quality_scores),
  categories=["low", "medium", "high"],
  ordered=True
)

del wine_raw

How do quality scores and quality groups match?

R code
py$wine |> with(table(quality_gr, quality)) |> addmargins() |> pander::pander()
Table 1: Match between wine quality scores (columns) and quality groups (rows). Values indicate counts. “Sum” shows column/row sums.
  3 4 5 6 7 8 Sum
low 10 53 0 0 0 0 63
medium 0 0 681 638 0 0 1319
high 0 0 0 0 199 18 217
Sum 10 53 681 638 199 18 1599

Variable recoding was performed without errors as numbers add up correctly.

Details: inspect wine dataset

Column names:

R code
colnames(py$wine)
 [1] "fixed_acidity"        "volatile_acidity"     "citric_acid"         
 [4] "residual_sugar"       "chlorides"            "free_sulfur_dioxide" 
 [7] "total_sulfur_dioxide" "density"              "ph"                  
[10] "sulphates"            "alcohol"              "quality"             
[13] "quality_gr"          

Dataset:

R code
head(py$wine)
R code
glimpse(py$wine)
Rows: 1,599
Columns: 13
$ fixed_acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
$ volatile_acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
$ citric_acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
$ residual_sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
$ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
$ free_sulfur_dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
$ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
$ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
$ ph                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
$ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
$ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
$ quality              <dbl> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
$ quality_gr           <ord> medium, medium, medium, medium, medium, medium, m…

3.3.2 Split into Training/Test Sets

In this project, for more accurate models’ prediction power, a 80:20 train/test hold-out split with stratification by wine quality scores will be used.

Python code
wine_train, wine_test = train_test_split(
  wine,
  test_size=0.2,
  random_state=50,
  stratify=wine['quality']
)

del wine

The sizes of training and test sets after splitting are 1279:320 wine samples. If needed, find more details below.

Details: inspect data after splitting to train/test sets

The dimensions of training set:

Python code
wine_train.shape
(1279, 13)

The dimensions of test set:

Python code
wine_test.shape
(320, 13)

Is the split really stratified? Yes as sample size ratios (column ratio) in all strata are around 0.80:

R code
full_join(
  py$wine_train |> count(quality, name = "n_train"),
  py$wine_test |> count(quality, name = "n_test"),
  by = "quality"
) |> 
  mutate(ratio = round(n_train / (n_train + n_test), digits = 2))

Dataset (train only):

R code
head(py$wine_train)
R code
glimpse(py$wine_train)
Rows: 1,279
Columns: 13
$ fixed_acidity        <dbl> 6.7, 8.6, 10.9, 7.7, 7.4, 10.3, 10.0, 9.0, 9.0, 8…
$ volatile_acidity     <dbl> 0.480, 0.490, 0.530, 1.005, 0.630, 0.500, 0.490, …
$ citric_acid          <dbl> 0.08, 0.29, 0.49, 0.15, 0.07, 0.42, 0.20, 0.25, 0…
$ residual_sugar       <dbl> 2.1, 2.0, 4.6, 2.1, 2.4, 2.0, 11.0, 2.0, 2.8, 2.0…
$ chlorides            <dbl> 0.064, 0.110, 0.118, 0.102, 0.090, 0.069, 0.071, …
$ free_sulfur_dioxide  <dbl> 18, 19, 10, 11, 11, 21, 13, 8, 28, 11, 9, 40, 32,…
$ total_sulfur_dioxide <dbl> 34, 133, 17, 32, 37, 51, 50, 21, 104, 27, 22, 83,…
$ density              <dbl> 0.99552, 0.99720, 1.00020, 0.99604, 0.99790, 0.99…
$ ph                   <dbl> 3.33, 2.93, 3.07, 3.23, 3.43, 3.16, 3.16, 3.27, 3…
$ sulphates            <dbl> 0.64, 1.98, 0.56, 0.48, 0.76, 0.72, 0.69, 0.72, 0…
$ alcohol              <dbl> 9.70000, 9.80000, 11.70000, 10.00000, 9.70000, 11…
$ quality              <dbl> 5, 5, 6, 5, 6, 6, 6, 5, 5, 6, 5, 6, 5, 5, 6, 4, 6…
$ quality_gr           <ord> medium, medium, medium, medium, medium, medium, m…

To avoid additional biases, test set will not be investigated until the final prediction models are created. Now let’s focus on training set only.

Use training set only

From here, further exploratory and inferential analysis as well as statistical modelling will use training set only.

3.3.3 Further Inspection of Training Set

Graphical and numeric inspection of each variable (find the details below) revealed that some variables are either moderately (skewness > 0.5) or highly (skewness > 1) right-skewed. As it is planned to use linear models in this project, skewed data may lead to non-linearities. This issue will be accounted for in the next subsection.

Details: plots and summaries of each variable on original scale
R code
DescTools::Desc(
  py$wine_train |> mutate(quality = ordered(quality)),
  verbose = 3
)
------------------------------------------------------------------------------ 
Describe mutate(py$wine_train, quality = ordered(quality)) (data.frame):

data frame: 1279 obs. of  13 variables
        1279 complete cases (100.0%)

  Nr  ColName               Class            NAs  Levels                  
  1   fixed_acidity         numeric          .                            
  2   volatile_acidity      numeric          .                            
  3   citric_acid           numeric          .                            
  4   residual_sugar        numeric          .                            
  5   chlorides             numeric          .                            
  6   free_sulfur_dioxide   numeric          .                            
  7   total_sulfur_dioxide  numeric          .                            
  8   density               numeric          .                            
  9   ph                    numeric          .                            
  10  sulphates             numeric          .                            
  11  alcohol               numeric          .                            
  12  quality               ordered, factor  .    (6): 1-3, 2-4, 3-5, 4-6,
                                                  5-7, ...                
  13  quality_gr            ordered, factor  .    (3): 1-low, 2-medium,   
                                                  3-high                  


------------------------------------------------------------------------------ 
1 - fixed_acidity (numeric)

  length       n    NAs  unique    0s   mean  meanCI'
   1'279   1'279      0      94     0   8.30    8.21
          100.0%   0.0%          0.0%           8.40
                                                    
     .05     .10    .25  median   .75    .90     .95
    6.10    6.50   7.10    7.90  9.20  10.60   11.71
                                                    
   range      sd  vcoef     mad   IQR   skew    kurt
   11.30    1.75   0.21    1.48  2.10   1.02    1.26
                                                    
lowest : 4.6, 4.9, 5.0 (4), 5.1 (3), 5.2 (6)
highest: 14.3, 15.0 (2), 15.5, 15.6 (2), 15.9

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
2 - volatile_acidity (numeric)

  length       n    NAs  unique    0s  mean  meanCI'
   1'279   1'279      0     141     0  0.53    0.52
          100.0%   0.0%          0.0%          0.54
                                                   
     .05     .10    .25  median   .75   .90     .95
    0.27    0.31   0.40    0.52  0.64  0.76    0.85
                                                   
   range      sd  vcoef     mad   IQR  skew    kurt
    1.46    0.18   0.34    0.18  0.24  0.71    1.32
                                                   
lowest : 0.12 (3), 0.16 (2), 0.18 (8), 0.19, 0.2 (2)
highest: 1.18, 1.18, 1.24, 1.33 (2), 1.58

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
3 - citric_acid (numeric)

  length       n    NAs  unique    0s  mean  meanCI'
   1'279   1'279      0      78   110  0.27    0.26
          100.0%   0.0%          8.6%          0.28
                                                   
     .05     .10    .25  median   .75   .90     .95
    0.00    0.01   0.09    0.25  0.42  0.53    0.60
                                                   
   range      sd  vcoef     mad   IQR  skew    kurt
    0.79    0.19   0.73    0.25  0.33  0.32   -0.88
                                                   
lowest : 0.0 (110), 0.01 (27), 0.02 (40), 0.03 (21), 0.04 (25)
highest: 0.74 (4), 0.75, 0.76 (2), 0.78, 0.79

heap(?): remarkable frequency (8.6%) for the mode(s) (= 0)

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
4 - residual_sugar (numeric)

  length       n    NAs  unique    0s  mean  meanCI'
   1'279   1'279      0      85     0  2.55    2.47
          100.0%   0.0%          0.0%          2.63
                                                   
     .05     .10    .25  median   .75   .90     .95
    1.50    1.70   1.90    2.20  2.60  3.61    5.20
                                                   
   range      sd  vcoef     mad   IQR  skew    kurt
   14.50    1.44   0.57    0.44  0.70  4.42   26.74
                                                   
lowest : 0.9, 1.2 (7), 1.3 (5), 1.4 (33), 1.5 (22)
highest: 12.9, 13.4, 13.8 (2), 13.9, 15.4 (2)

heap(?): remarkable frequency (9.9%) for the mode(s) (= 2)

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
5 - chlorides (numeric)

  length       n     NAs  unique      0s    mean   meanCI'
   1'279   1'279       0     141       0  0.0867   0.0843
          100.0%    0.0%            0.0%           0.0891
                                                         
     .05     .10     .25  median     .75     .90      .95
  0.0540  0.0600  0.0700  0.0790  0.0900  0.1062   0.1242
                                                         
   range      sd   vcoef     mad     IQR    skew     kurt
  0.4520  0.0437  0.5045  0.0148  0.0200  5.0275  31.2563
                                                         
lowest : 0.012 (2), 0.034, 0.038 (2), 0.039 (4), 0.041 (4)
highest: 0.413, 0.414 (2), 0.415 (2), 0.422, 0.464

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
6 - free_sulfur_dioxide (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      58      0  15.70   15.12
          100.0%   0.0%           0.0%          16.29
                                                     
     .05     .10    .25  median    .75    .90     .95
    4.00    5.00   7.00   13.00  21.00  30.00   35.00
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   71.00   10.60   0.68   10.38  14.00   1.31    2.24
                                                     
lowest : 1.0 (2), 3.0 (44), 4.0 (34), 5.0 (86), 5.5
highest: 54.0, 55.0 (2), 66.0, 68.0 (2), 72.0

heap(?): remarkable frequency (9.1%) for the mode(s) (= 6)

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
7 - total_sulfur_dioxide (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0     140      0  45.67   43.85
          100.0%   0.0%           0.0%          47.49
                                                     
     .05     .10    .25  median    .75    .90     .95
   11.00   14.00  21.00   37.00  61.00  94.00  112.00
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
  283.00   33.16   0.73   26.69  40.00   1.61    4.46
                                                     
lowest : 6.0 (3), 7.0 (4), 8.0 (10), 9.0 (13), 10.0 (24)
highest: 155.0, 160.0, 165.0, 278.0, 289.0

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
8 - density (numeric)

    length         n       NAs    unique        0s      mean    meanCI'
     1'279     1'279         0       400         0  0.996736  0.996633
              100.0%      0.0%                0.0%            0.996839
                                                                      
       .05       .10       .25    median       .75       .90       .95
  0.993598  0.994480  0.995600  0.996720  0.997825  0.999108  1.000000
                                                                      
     range        sd     vcoef       mad       IQR      skew      kurt
  0.013620  0.001883  0.001889  0.001661  0.002225  0.127046  0.867452
                                                                      
lowest : 0.99007, 0.99064 (2), 0.99084, 0.9912, 0.9915
highest: 1.0026, 1.00289, 1.00315 (2), 1.0032, 1.00369 (2)

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
9 - ph (numeric)

  length       n    NAs  unique    0s  mean  meanCI'
   1'279   1'279      0      86     0  3.31    3.30
          100.0%   0.0%          0.0%          3.32
                                                   
     .05     .10    .25  median   .75   .90     .95
    3.07    3.13   3.21    3.31  3.40  3.51    3.57
                                                   
   range      sd  vcoef     mad   IQR  skew    kurt
    1.15    0.15   0.05    0.15  0.19  0.25    0.88
                                                   
lowest : 2.86, 2.87, 2.88 (2), 2.89 (2), 2.9
highest: 3.72 (3), 3.75, 3.78 (2), 3.9 (2), 4.01 (2)

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
10 - sulphates (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      88      0  0.653   0.644
          100.0%   0.0%           0.0%          0.662
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.470   0.490  0.550   0.620  0.730  0.840   0.921
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   1.650   0.163  0.250   0.119  0.180  2.263  10.489
                                                     
lowest : 0.33, 0.37, 0.39 (5), 0.4 (4), 0.42 (4)
highest: 1.59, 1.61, 1.62, 1.95, 1.98

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
11 - alcohol (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      61      0  10.42   10.36
          100.0%   0.0%           0.0%          10.48
                                                     
     .05     .10    .25  median    .75    .90     .95
    9.20    9.30   9.50   10.20  11.10  12.00   12.50
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
    6.50    1.06   0.10    1.04   1.60   0.87    0.23
                                                     
lowest : 8.4 (2), 8.5, 8.7, 8.8 (2), 9.0 (22)
highest: 13.4 (2), 13.5, 13.6, 14.0 (6), 14.9

heap(?): remarkable frequency (8.8%) for the mode(s) (= 9.5)

' 95%-CI (classic)

Summary of variables on original scale.

------------------------------------------------------------------------------ 
12 - quality (ordered, factor)

  length      n    NAs unique levels  dupes
   1'279  1'279      0      6      6      y
         100.0%   0.0%                     

   level  freq   perc  cumfreq  cumperc
1      3     8   0.6%        8     0.6%
2      4    42   3.3%       50     3.9%
3      5   545  42.6%      595    46.5%
4      6   510  39.9%    1'105    86.4%
5      7   159  12.4%    1'264    98.8%
6      8    15   1.2%    1'279   100.0%

Summary of variables on original scale.

------------------------------------------------------------------------------ 
13 - quality_gr (ordered, factor)

  length      n    NAs unique levels  dupes
   1'279  1'279      0      3      3      y
         100.0%   0.0%                     

    level   freq   perc  cumfreq  cumperc
1     low     50   3.9%       50     3.9%
2  medium  1'055  82.5%    1'105    86.4%
3    high    174  13.6%    1'279   100.0%

Summary of variables on original scale.

3.3.4 Log-Transformation of Right-Skewed Variables

The amount of skewness in variables was investigated before (columns skewness_original in the table below) and after (skewness_after_log) log-transformation, which was applied for variables with skewness above 0.5.

Python code
skewness = (
  wine_train
  .skew(numeric_only=True)
  .sort_values(ascending=False)
  .rename("skewness_original")
)

skewed_vars = skewness.where(skewness > 0.5).dropna().index.tolist()

skewness_after = (
  wine_train
  .transform_columns(skewed_vars, np.log10)
  .skew(numeric_only=True)
  .rename("skewness_after_log")
)

skew_diff = (skewness_after - skewness).rename("difference")

pd.concat(
  [skewness, skewness_after, skew_diff], axis=1
  ).reset_index(names="variable")
                variable  skewness_original  skewness_after_log  difference
0              chlorides              5.039               1.530      -3.509
1         residual_sugar              4.435               1.803      -2.632
2              sulphates              2.268               0.852      -1.416
3   total_sulfur_dioxide              1.616              -0.033      -1.649
4    free_sulfur_dioxide              1.313              -0.172      -1.485
5          fixed_acidity              1.025               0.426      -0.599
6                alcohol              0.870               0.669      -0.201
7       volatile_acidity              0.710              -0.434      -1.144
8            citric_acid              0.323               0.323       0.000
9                     ph              0.254               0.254       0.000
10               quality              0.226               0.226       0.000
11               density              0.127               0.127       0.000
Guidelines to evaluate degree of skewness (coefficient of asymmetry)

I use these guidelines to interpret skewness coefficient.

  1. Interpret the direction of skewness:
    • \(< 0\): left-skewed data;
    • \(0\): ideally symmetric data;
    • \(0 <\): right-skewed data.
  1. Interpret the amount of skewness (rule of thumb):
    • \(0\): ideally symmetric;
    • \(-0.5\)\(+0.5\): almost symmetric;
    • \(-1\)\(-0.5\) and \(+0.5\)\(+1\): moderate asymmetry;
    • \(< -1\) and \(+1 < ~\): high asymmetry.

In all cases, amount of skewness was reduced. Just for alcohol, the coefficient changed not much. And in case of volatile_acidity, from moderately right-skewed data became into slightly left-sewed.

Apply the transformation to the dataset:

Python code
# Create a copy of data frame that contains log-transformed variables
new_names_map = {i:"log_" + str(i) for i in skewed_vars}

wine_train_log = (
  wine_train
  .transform_columns(skewed_vars, np.log10)
  .rename(new_names_map, axis=1)
)
Details: inspect data after log-transformation

Column names:

R code
colnames(py$wine_train_log)
 [1] "log_fixed_acidity"        "log_volatile_acidity"    
 [3] "citric_acid"              "log_residual_sugar"      
 [5] "log_chlorides"            "log_free_sulfur_dioxide" 
 [7] "log_total_sulfur_dioxide" "density"                 
 [9] "ph"                       "log_sulphates"           
[11] "log_alcohol"              "quality"                 
[13] "quality_gr"              

Dataset:

R code
head(py$wine_train_log)
R code
glimpse(py$wine_train_log)
Rows: 1,279
Columns: 13
$ log_fixed_acidity        <dbl> 0.8260748, 0.9344985, 1.0374265, 0.8864907, 0…
$ log_volatile_acidity     <dbl> -0.318758763, -0.309803920, -0.275724130, 0.0…
$ citric_acid              <dbl> 0.08, 0.29, 0.49, 0.15, 0.07, 0.42, 0.20, 0.2…
$ log_residual_sugar       <dbl> 0.3222193, 0.3010300, 0.6627578, 0.3222193, 0…
$ log_chlorides            <dbl> -1.1938200, -0.9586073, -0.9281180, -0.991399…
$ log_free_sulfur_dioxide  <dbl> 1.2552725, 1.2787536, 1.0000000, 1.0413927, 1…
$ log_total_sulfur_dioxide <dbl> 1.531479, 2.123852, 1.230449, 1.505150, 1.568…
$ density                  <dbl> 0.99552, 0.99720, 1.00020, 0.99604, 0.99790, …
$ ph                       <dbl> 3.33, 2.93, 3.07, 3.23, 3.43, 3.16, 3.16, 3.2…
$ log_sulphates            <dbl> -0.19382003, 0.29666519, -0.25181197, -0.3187…
$ log_alcohol              <dbl> 0.9867717, 0.9912261, 1.0681859, 1.0000000, 0…
$ quality                  <dbl> 5, 5, 6, 5, 6, 6, 6, 5, 5, 6, 5, 6, 5, 5, 6, …
$ quality_gr               <ord> medium, medium, medium, medium, medium, mediu…
Details: plots and summaries of variables after log-transformation
R code
DescTools::Desc(
  py$wine_train_log |> mutate(quality = ordered(quality)),
  verbose = 3, digits = 3
)
------------------------------------------------------------------------------ 
Describe mutate(py$wine_train_log, quality = ordered(quality)) (data.frame):

data frame: 1279 obs. of  13 variables
        1279 complete cases (100.0%)

  Nr  ColName                   Class            NAs  Levels               
  1   log_fixed_acidity         numeric          .                         
  2   log_volatile_acidity      numeric          .                         
  3   citric_acid               numeric          .                         
  4   log_residual_sugar        numeric          .                         
  5   log_chlorides             numeric          .                         
  6   log_free_sulfur_dioxide   numeric          .                         
  7   log_total_sulfur_dioxide  numeric          .                         
  8   density                   numeric          .                         
  9   ph                        numeric          .                         
  10  log_sulphates             numeric          .                         
  11  log_alcohol               numeric          .                         
  12  quality                   ordered, factor  .    (6): 1-3, 2-4, 3-5,  
                                                      4-6, 5-7, ...        
  13  quality_gr                ordered, factor  .    (3): 1-low, 2-medium,
                                                      3-high               


------------------------------------------------------------------------------ 
1 - log_fixed_acidity (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      94      0  0.910   0.906
          100.0%   0.0%           0.0%          0.915
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.785   0.813  0.851   0.898  0.964  1.025   1.069
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   0.539   0.087  0.095   0.078  0.113  0.425   0.131
                                                     
lowest : 0.663, 0.690, 0.699 (4), 0.708 (3), 0.716 (6)
highest: 1.155, 1.176 (2), 1.190, 1.193 (2), 1.201

' 95%-CI (classic)

------------------------------------------------------------------------------ 
2 - log_volatile_acidity (numeric)

  length       n     NAs  unique      0s    mean  meanCI'
   1'279   1'279       0     141       1  -0.301  -0.310
          100.0%    0.0%            0.1%          -0.292
                                                        
     .05     .10     .25  median     .75     .90     .95
  -0.569  -0.509  -0.398  -0.284  -0.194  -0.119  -0.071
                                                        
   range      sd   vcoef     mad     IQR    skew    kurt
   1.119   0.156  -0.517   0.153   0.204  -0.433   0.364
                                                        
lowest : -0.921 (3), -0.796 (2), -0.745 (8), -0.721, -0.699 (2)
highest: 0.072, 0.074, 0.093, 0.124 (2), 0.199

' 95%-CI (classic)

------------------------------------------------------------------------------ 
3 - citric_acid (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      78    110  0.268   0.257
          100.0%   0.0%           8.6%          0.278
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.000   0.010  0.090   0.250  0.420  0.530   0.600
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   0.790   0.195  0.728   0.252  0.330  0.322  -0.882
                                                     
lowest : 0.000 (110), 0.010 (27), 0.020 (40), 0.030 (21), 0.040 (25)
highest: 0.740 (4), 0.750, 0.760 (2), 0.780, 0.790

heap(?): remarkable frequency (8.6%) for the mode(s) (= 0)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
4 - log_residual_sugar (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      85      0  0.369   0.361
          100.0%   0.0%           0.0%          0.378
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.176   0.230  0.279   0.342  0.415  0.558   0.716
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   1.233   0.158  0.428   0.094  0.136  1.799   4.740
                                                     
lowest : -0.046, 0.079 (7), 0.114 (5), 0.146 (33), 0.176 (22)
highest: 1.111, 1.127, 1.140 (2), 1.143, 1.188 (2)

heap(?): remarkable frequency (9.9%) for the mode(s) (= 0.301029995663981)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
5 - log_chlorides (numeric)

  length       n     NAs  unique      0s    mean  meanCI'
   1'279   1'279       0     141       0  -1.090  -1.098
          100.0%    0.0%            0.0%          -1.083
                                                        
     .05     .10     .25  median     .75     .90     .95
  -1.268  -1.222  -1.155  -1.102  -1.046  -0.974  -0.906
                                                        
   range      sd   vcoef     mad     IQR    skew    kurt
   1.587   0.141  -0.129   0.078   0.109   1.527   8.730
                                                        
lowest : -1.921 (2), -1.469, -1.420 (2), -1.409 (4), -1.387 (4)
highest: -0.384, -0.383 (2), -0.382 (2), -0.375, -0.333

' 95%-CI (classic)

------------------------------------------------------------------------------ 
6 - log_free_sulfur_dioxide (numeric)

  length       n    NAs  unique     0s    mean  meanCI'
   1'279   1'279      0      58      2   1.098   1.082
          100.0%   0.0%           0.2%           1.115
                                                      
     .05     .10    .25  median    .75     .90     .95
   0.602   0.699  0.845   1.114  1.322   1.477   1.544
                                                      
   range      sd  vcoef     mad    IQR    skew    kurt
   1.857   0.301  0.274   0.339  0.477  -0.172  -0.566
                                                      
lowest : 0.000 (2), 0.477 (44), 0.602 (34), 0.699 (86), 0.740
highest: 1.732, 1.740 (2), 1.820, 1.833 (2), 1.857

heap(?): remarkable frequency (9.1%) for the mode(s) (= 0.778151250383644)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
7 - log_total_sulfur_dioxide (numeric)

  length       n    NAs  unique     0s    mean  meanCI'
   1'279   1'279      0     140      0   1.553   1.536
          100.0%   0.0%           0.0%           1.570
                                                      
     .05     .10    .25  median    .75     .90     .95
   1.041   1.146  1.322   1.568  1.785   1.973   2.049
                                                      
   range      sd  vcoef     mad    IQR    skew    kurt
   1.683   0.310  0.199   0.335  0.463  -0.033  -0.693
                                                      
lowest : 0.778 (3), 0.845 (4), 0.903 (10), 0.954 (13), 1.000 (24)
highest: 2.190, 2.204, 2.217, 2.444, 2.461

' 95%-CI (classic)

------------------------------------------------------------------------------ 
8 - density (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0     400      0  0.997   0.997
          100.0%   0.0%           0.0%          0.997
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.994   0.994  0.996   0.997  0.998  0.999   1.000
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   0.014   0.002  0.002   0.002  0.002  0.127   0.867
                                                     
lowest : 0.990, 0.991 (2), 0.991, 0.991, 0.992
highest: 1.003, 1.003, 1.003 (2), 1.003, 1.004 (2)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
9 - ph (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      86      0  3.312   3.304
          100.0%   0.0%           0.0%          3.321
                                                     
     .05     .10    .25  median    .75    .90     .95
   3.070   3.130  3.210   3.310  3.400  3.510   3.570
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   1.150   0.153  0.046   0.148  0.190  0.254   0.881
                                                     
lowest : 2.860, 2.870, 2.880 (2), 2.890 (2), 2.900
highest: 3.720 (3), 3.750, 3.780 (2), 3.900 (2), 4.010 (2)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
10 - log_sulphates (numeric)

  length       n     NAs  unique      0s    mean  meanCI'
   1'279   1'279       0      88       1  -0.196  -0.201
          100.0%    0.0%            0.1%          -0.191
                                                        
     .05     .10     .25  median     .75     .90     .95
  -0.328  -0.310  -0.260  -0.208  -0.137  -0.076  -0.036
                                                        
   range      sd   vcoef     mad     IQR    skew    kurt
   0.778   0.096  -0.487   0.089   0.123   0.850   1.854
                                                        
lowest : -0.481, -0.432, -0.409 (5), -0.398 (4), -0.377 (4)
highest: 0.201, 0.207, 0.210, 0.290, 0.297

' 95%-CI (classic)

------------------------------------------------------------------------------ 
11 - log_alcohol (numeric)

  length       n    NAs  unique     0s   mean  meanCI'
   1'279   1'279      0      61      0  1.016   1.013
          100.0%   0.0%           0.0%          1.018
                                                     
     .05     .10    .25  median    .75    .90     .95
   0.964   0.968  0.978   1.009  1.045  1.079   1.097
                                                     
   range      sd  vcoef     mad    IQR   skew    kurt
   0.249   0.043  0.042   0.046  0.068  0.667  -0.263
                                                     
lowest : 0.924 (2), 0.929, 0.940, 0.944 (2), 0.954 (22)
highest: 1.127 (2), 1.130, 1.134, 1.146 (6), 1.173

heap(?): remarkable frequency (8.8%) for the mode(s) (= 0.977723605288848)

' 95%-CI (classic)

------------------------------------------------------------------------------ 
12 - quality (ordered, factor)

  length      n    NAs unique levels  dupes
   1'279  1'279      0      6      6      y
         100.0%   0.0%                     

   level  freq     perc  cumfreq   cumperc
1      3     8   0.625%        8    0.625%
2      4    42   3.284%       50    3.909%
3      5   545  42.611%      595   46.521%
4      6   510  39.875%    1'105   86.396%
5      7   159  12.432%    1'264   98.827%
6      8    15   1.173%    1'279  100.000%

------------------------------------------------------------------------------ 
13 - quality_gr (ordered, factor)

  length      n    NAs unique levels  dupes
   1'279  1'279      0      3      3      y
         100.0%   0.0%                     

    level   freq     perc  cumfreq   cumperc
1     low     50   3.909%       50    3.909%
2  medium  1'055  82.486%    1'105   86.396%
3    high    174  13.604%    1'279  100.000%

4 Analysis

4.1 Correlation: All Pairs

In this section we will look into the relationships between each pair of numeric variables. We will start from graphical analysis (scatter plot matrices) and later will proceed to linear correlation analysis.

Scatter plot matrices of non-transformed and log-transformed variables are presented in Fig. 1 and Fig. 2. In the plot we see, that some variables are related (e.g., density and fixed acidity). It can also be noticed, that after log-transformation the relationship between some variables (e.g., pH and fixed acidity) look more linear. On the other hand, the plots are small and there are many points, so further investigation is needed to understand data better.

Python code
fig_sns_1 = sns.pairplot(
  wine_train,
  kind="hist", 
  height=1,
  corner=True,
);

for ax in fig_sns_1.axes.flatten():
  if ax:
    # rotate x axis labels
    ax.set_xlabel(ax.get_xlabel(), rotation = 90);
    # rotate y axis labels
    ax.set_ylabel(ax.get_ylabel(), rotation = 0);
    # set y labels alignment
    ax.yaxis.get_label().set_horizontalalignment('right');
    
fig_sns_1.tight_layout();
plt.show()

Figure 1: Relationships between pairs of variables on the original scale.

Python code
fig_sns_2 = sns.pairplot(
  wine_train_log,
  kind="hist", 
  height=1,
  corner=True,
);

for ax in fig_sns_2.axes.flatten():
  if ax:
    # rotate x axis labels
    ax.set_xlabel(ax.get_xlabel(), rotation = 90);
    # rotate y axis labels
    ax.set_ylabel(ax.get_ylabel(), rotation = 0);
    # set y labels alignment
    ax.yaxis.get_label().set_horizontalalignment('right');
    
fig_sns_2.tight_layout();
plt.show()

Figure 2: Relationships between pairs of variables after log-transformation of right-skewed variables.

Next, the influence of logarithmic transformation on change in linear correlation coefficient is analyzed and summarized in the table below. The negative values or r_abs_difference (difference in absolute values of correlation coefficient) means that correlation decreased, positive value shows increase in correlation strength.

Python code
# Calculate correlation
corr_df_orig = pairwise_correlation(wine_train)
corr_df_log = pairwise_correlation(wine_train_log)

# Investigate the changes in coefficients due to transformation
(
  pd.concat([
    corr_df_orig.r.rename("r_orig_var"),
    corr_df_log.r.rename("r_log_var")
    ], axis=1
    )
    .eval("r_abs_difference = abs(r_log_var) - abs(r_orig_var)")
    .select_columns("r_abs_difference")
    .agg(["min", "mean", "std", "max"])
)
      r_abs_difference
min             -0.114
mean             0.016
std              0.042
max              0.146

The average shift in correlation coefficient was only slight, as in some coefficient started showing stronger, in other cases weaker correlation. The cases with largest positive and negative change (indicated by the the min and max values of r_abs_difference) are analyzed in the details sections. To sum up, it seams that in both cases log-transformation the true correlation was reflected better by the coefficients after the transformation.

Details: case study of chlorides vs. sulphates

After log-transformation, linear correlation strength between chlorides and sulphates decreased most evidently (change in absolute values is -0.11). It seems that the linear correlation coefficient size between these two variables is driven by 17 outlying points (where chlorides > 0.3, red dashed line in the plot below). Logarithmic transformation reduces this issue and true correlation strength is reflected better in this case. On the other hand, rank correlation coefficient suggests that there is no correlation or it is very weak.

R code
ggplot(py$wine_train, aes(x = chlorides, y = sulphates)) +
  geom_point(alpha = 0.1, pch = 19) +
  geom_vline(xintercept = 0.3, color = "darkred", lty = 2, alpha = 0.5) +
  ggtitle("Variables on Original Scale") +

ggplot(py$wine_train, aes(x = log(chlorides), y = log(sulphates))) +
  geom_point(alpha = 0.1, pch = 19) +
  ggtitle("Variables on Log Scale")

R code
py$wine_train |> count(chlorides > 0.3)
R code
c(
  "r_original_variables" = with(
    py$wine_train, cor(chlorides, sulphates, method = "pearson")
  ),
  "r_log_variables" = with(
    py$wine_train,
    cor(log(chlorides), log(sulphates), method = "pearson")
  ),
  "r_without_17_outliers" = with(
    py$wine_train |> filter(chlorides < 0.3),
    cor(chlorides, sulphates, method = "pearson")
  ),
  "r_spearman_all_data" = with(
    py$wine_train, cor(chlorides, sulphates, method = "spearman")
  )
) |> round(2)
 r_original_variables       r_log_variables r_without_17_outliers 
                 0.31                  0.20                  0.05 
  r_spearman_all_data 
                -0.01 
Details: case study of chlorides vs. density

After log-transformation, linear correlation size between chlorides and density increased most evidently (change in absolute values is +0.15). It seems that the linear correlation coefficient size again was influenced by a group of 17 outlying points (where chlorides > 0.3, red dashed line in the plot below) and logarithmic transformation reduced this issue and correlation of log-transformed variables and of non-transformed variables without the 17 points is almost of the same size (r = 0.36 vs. r = 0.34).

R code
ggplot(py$wine_train, aes(x = chlorides, y = density)) +
  geom_point(alpha = 0.1, pch = 19) +
  geom_vline(xintercept = 0.3, color = "darkred", lty = 2, alpha = 0.5) +
  ggtitle("Variables on Original Scale") +

ggplot(py$wine_train, aes(x = log(chlorides), y = density)) +
  geom_point(alpha = 0.1, pch = 19) +
  ggtitle("Log-Transformed Chloride")

R code
py$wine_train |> count(chlorides > 0.3)
R code
c(
  "r_original_variables" = with(
    py$wine_train, cor(chlorides, density, method = "pearson")
  ),
  "r_log_variables" = with(
    py$wine_train,
    cor(log(chlorides), density, method = "pearson")
  ),
  "r_without_17_outliers" = with(
    py$wine_train |> filter(chlorides < 0.3),
    cor(chlorides, density, method = "pearson")
  ),
  "r_spearman_all_data" = with(
    py$wine_train, cor(chlorides, density, method = "spearman")
  )
) |> round(2)
 r_original_variables       r_log_variables r_without_17_outliers 
                 0.21                  0.36                  0.34 
  r_spearman_all_data 
                 0.41 

Fig. 3 contains a graphical representation of all pair-wise correlation coefficients. The “details” sections below contain numeric results while the two tables below the section list correlations with quality sore and with alcohol contents.

R code
py$wine_train_log |> 
  select_if(is.numeric) |> 
  ggstatsplot::ggcorrmat() + 
  scale_y_discrete(limits = rev)

Figure 3: The results of correlation analysis after right-skewed variables were log-transformed. Numbers in cells indicate Pearson correlation coefficient. Cross-out cells show statistically insignificant results. Significance level is 0.05, p values with Holm correction were used.

Details: correlation table (variables on original scale)

We can suspect, that linear correlation coefficients do not reflect the true strength skewed between variables correctly.

R code
py$corr_df_orig |>
  rowwise() |> 
  mutate(
    `CI95%_lower` = `CI95%`[[1]],
    `CI95%_upper` = `CI95%`[[2]],
    r = round(r, digits = 2)
  ) |>
  select(-`CI95%`) |>
  ungroup() |>
  rownames_to_column("index")

The results of correlation analysis as correlation matrix. All variables are on the original scale.

Details: correlation table (after log-transformation)
R code
py$corr_df_log |>
  rowwise() |> 
  mutate(
    `CI95%_lower` = `CI95%`[[1]],
    `CI95%_upper` = `CI95%`[[2]],
    r = round(r, digits = 2)
  ) |>
  select(-`CI95%`) |>
  ungroup() |>
  rownames_to_column("index")

The results of correlation analysis after right-skewed variables were log-transformed.

Results. The strongest correlation is detected between log_free_sulfur_dioxide vs. log_total_sulfur_dioxide (r =, 0.79 [0.77, 0.81], p_adj < 0.001) and log_fixed_acidity and ph (r = -0.71, 95% CI [-0.74, -0.69], p_adj < 0.001). In statistical modeling these variables will be used as explanatory variables. This means that in linear modeling, a multicollinearity problem may appear. So we have to pay attention to this while interpreting regression coefficients results.

4.2 Exploratory PCA

4.2.1 Model Diagnostics

Principal component analysis (PCA) will be used to explore relationships between target (wine quality score, quality group as well as alcohol content) and other variables.

R code
pca_model <- prcomp(py$wine_train_log |> select_if(is.numeric), scale. = TRUE)
R code
fviz_eig(pca_model, addlabels = TRUE) + ylim(NA, 29)

Details: table with eigenvalues and percentage of explained variance
R code
get_eigenvalue(pca_model) |> round(1)
  • To explain more than 80 % of variance, 5 components are required.
  • “Elbow” method suggests 3-4 components.
  • Eigenvalue above 1 rule suggest 4-5 components.

These numbers might be important while doing feature selection. For visual inspection, it will be used the first 3 principal components that account for 26.8 %, 20.1 %, and 14.6 % respectively.

4.2.2 PCA Individuals Map

In PCA “individuals map”, no very clear clusters are visible. Only it can be noticed that in the space of PC1 and PC2 (denoted as Dim1 and Dim2), roughly half area with points is more dense, and the other half is less dense.

R code
fviz_pca_ind(pca_model, label = FALSE, alpha = 0.2)

R code
fviz_pca_ind(pca_model, axes = c(3, 2), label = FALSE, alpha = 0.2)

After adding colors to the points, 3 partly overlapping clusters are visible. The direction of cluster centroids match the order of quality groups from lowest to highest. It seems that all 3 first PCs have influence on cluster separability.

R code
legend_txt = "Quality"

fviz_pca_ind(pca_model, label = FALSE, alpha = 0.3,
  habillage = py$wine_train_log$quality_gr, addEllipses = TRUE,
  palette = "Dark2") +
  labs(color = legend_txt, fill = legend_txt, shape = legend_txt)

R code
fviz_pca_ind(pca_model, axes = c(3, 2), label = FALSE, alpha = 0.3,
  habillage = py$wine_train_log$quality_gr, addEllipses = TRUE,
  palette = "Dark2") +
  labs(color = legend_txt, fill = legend_txt, shape = legend_txt)

4.2.3 PCA Correlation Circle Plot

The purpose of correlation circle plot is to investigate relationship between variables. The angle between arrows shows the strength of relationship in the analyzed PCA space, the length of arrow shows how well variable is represented in the PCA space.

The variables of interest are quality score and fraction of alcohol content. These variables will be highlighted.

R code
highlight_vars <- recode_factor(
  names(py$wine_train_log |> select_if(is.numeric)),
  "quality" = "quality",
  "log_alcohol" = "log_alcohol",
  .default =  "other"
)
R code
fviz_pca_var(pca_model,
  axes = c(1, 2), repel = TRUE, alpha = 0.7,
  col.var = highlight_vars, palette = c("black", "darkred", "skyblue3")
) +
  lims(x = c(-1.1, 1.1), y = c(-1.1, 1.1)) +
  theme(legend.position = "none")

R code
fviz_pca_var(pca_model,
  axes = c(3, 2), repel = TRUE, alpha = 0.7,
  col.var = highlight_vars, palette = c("black", "darkred", "skyblue3")
) +
  lims(x = c(-1.1, 1.1), y = c(-1.1, 1.1)) +
  theme(legend.position = "none")

4.2.4 PCA Biplot

The purpose of PCA biplots is to investigate the relationship between patterns in points (e.g., clusters) and variables.

R code
fviz_pca_biplot(pca_model,
  axes = c(1, 2), repel = TRUE, alpha.ind = 0.2, alpha.var = 0.85,
  col.var = "black",
  # col.var = highlight_vars,
  label = "var",
  addEllipses = TRUE,
  habillage = py$wine_train_log$quality_gr, 
  palette = "Dark2"
) +
  labs(color = legend_txt, fill = legend_txt, shape = legend_txt)

R code
fviz_pca_biplot(pca_model,
  axes = c(3, 2), repel = TRUE, alpha.ind = 0.2, alpha.var = 0.85,
  col.var = "black",
  # col.var = highlight_vars,
  label = "var",
  addEllipses = TRUE,
  habillage = py$wine_train_log$quality_gr, 
  palette = "Dark2"
) +
  labs(color = legend_txt, fill = legend_txt, shape = legend_txt)

4.3 Modelling Wine Quality

4.3.1 Correlation

Correlation strength between wine quality score and the remaining variables:

Python code
corr_df_log.query("X=='quality' | Y == 'quality'")
                           X        Y      r           CI95%   p_adj
65               log_alcohol  quality  0.447     [0.4, 0.49]  <0.001
20      log_volatile_acidity  quality -0.407  [-0.45, -0.36]  <0.001
64             log_sulphates  quality  0.345     [0.3, 0.39]  <0.001
29               citric_acid  quality  0.247     [0.19, 0.3]  <0.001
44             log_chlorides  quality -0.169  [-0.22, -0.12]  <0.001
55  log_total_sulfur_dioxide  quality -0.162  [-0.22, -0.11]  <0.001
59                   density  quality -0.152   [-0.21, -0.1]  <0.001
10         log_fixed_acidity  quality  0.123    [0.07, 0.18]  <0.001
62                        ph  quality -0.063  [-0.12, -0.01]   0.324
50   log_free_sulfur_dioxide  quality -0.041    [-0.1, 0.01]  >0.999
37        log_residual_sugar  quality  0.016   [-0.04, 0.07]  >0.999

4.4 Modelling Alcohol Content

The purpose of this section is to create a linear model that predicts the fraction of alcohol content in wine.

Quick pair-wise summary of how other variables relate to the fraction of alcohol content in wine is presented in the details section.

Details: plots and summaries of variables by alcohol content

Let’s investigate relationship between alcohol content and other variables.

R code
py$wine_train_log %>%
  mutate(quality = ordered(quality)) %>%
  DescTools::Desc(. ~ log_alcohol, data = ., verbose = 3)
------------------------------------------------------------------------------ 
log_fixed_acidity ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.101
Spearman corr.: -0.077
Kendall corr. : -0.056

------------------------------------------------------------------------------ 
log_volatile_acidity ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.223
Spearman corr.: -0.214
Kendall corr. : -0.144

------------------------------------------------------------------------------ 
citric_acid ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : 0.100
Spearman corr.: 0.083
Kendall corr. : 0.055

------------------------------------------------------------------------------ 
log_residual_sugar ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : 0.069
Spearman corr.: 0.109
Kendall corr. : 0.076

------------------------------------------------------------------------------ 
log_chlorides ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.318
Spearman corr.: -0.306
Kendall corr. : -0.213

------------------------------------------------------------------------------ 
log_free_sulfur_dioxide ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.085
Spearman corr.: -0.080
Kendall corr. : -0.055

------------------------------------------------------------------------------ 
log_total_sulfur_dioxide ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.237
Spearman corr.: -0.253
Kendall corr. : -0.175

------------------------------------------------------------------------------ 
density ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : -0.489
Spearman corr.: -0.461
Kendall corr. : -0.329

------------------------------------------------------------------------------ 
ph ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : 0.229
Spearman corr.: 0.204
Kendall corr. : 0.144

------------------------------------------------------------------------------ 
log_sulphates ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%)


Pearson corr. : 0.152
Spearman corr.: 0.212
Kendall corr. : 0.148

------------------------------------------------------------------------------ 
quality ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 6

                                                            
              3        4        5        6        7        8
mean      1.002    1.016    0.995    1.024    1.056    1.074
median    1.002    1.015    0.987    1.021    1.061    1.068
sd        0.038    0.040    0.031    0.042    0.037    0.044
IQR       0.041    0.066    0.035    0.065    0.051    0.058
n             8       42      545      510      159       15
np       0.625%   3.284%  42.611%  39.875%  12.432%   1.173%
NAs           0        0        0        0        0        0
0s            0        0        0        0        0        0
min       0.924    0.954    0.929    0.924    0.964    0.991
max       1.041    1.117    1.173    1.146    1.146    1.146
Q1        0.990    0.982    0.973    0.991    1.031    1.047
Q3        1.031    1.048    1.009    1.056    1.083    1.106
mad       0.032    0.048    0.020    0.047    0.040    0.048
skew     -0.791    0.199    1.533    0.371   -0.200   -0.331
kurt     -0.510   -0.833    3.497   -0.475   -0.572   -0.942

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 309.38, df = 5, p-value < 2.2e-16





Proportions of quality in the quantiles of log_alcohol:
   
         Q1      Q2      Q3      Q4
  3    0.3%    1.2%    1.0%    0.0%
  4    2.1%    3.9%    3.7%    3.6%
  5   68.2%   54.0%   31.7%   11.9%
  6   28.2%   34.7%   49.3%   49.3%
  7    1.2%    5.6%   13.7%   31.5%
  8    0.0%    0.6%    0.7%    3.6%

------------------------------------------------------------------------------ 
quality_gr ~ log_alcohol (.)

Summary: 
n pairs: 1'279, valid: 1'279 (100.0%), missings: 0 (0.0%), groups: 3

                                 
            low   medium     high
mean      1.014    1.009    1.057
median    1.011    1.000    1.061
sd        0.040    0.040    0.038
IQR       0.058    0.058    0.053
n            50    1'055      174
np       3.909%  82.486%  13.604%
NAs           0        0        0
0s            0        0        0
min       0.924    0.924    0.964
max       1.117    1.173    1.146
Q1        0.983    0.978    1.033
Q3        1.041    1.035    1.086
mad       0.044    0.038    0.040
skew      0.085    0.892   -0.177
kurt     -0.520    0.362   -0.569

Kruskal-Wallis rank sum test:
  Kruskal-Wallis chi-squared = 173.51, df = 2, p-value < 2.2e-16





Proportions of quality_gr in the quantiles of log_alcohol:
        
              Q1      Q2      Q3      Q4
  low       2.4%    5.0%    4.7%    3.6%
  medium   96.5%   88.7%   81.0%   61.3%
  high      1.2%    6.2%   14.3%   35.1%

4.4.1 Correlation

Correlation strength between alcohol concentration and the remaining variables:

Python code
corr_df_log.query("X=='log_alcohol' | Y == 'log_alcohol'")
                           X            Y      r           CI95%   p_adj
58                   density  log_alcohol -0.489  [-0.53, -0.45]  <0.001
65               log_alcohol      quality  0.447     [0.4, 0.49]  <0.001
43             log_chlorides  log_alcohol -0.318  [-0.37, -0.27]  <0.001
54  log_total_sulfur_dioxide  log_alcohol -0.237  [-0.29, -0.19]  <0.001
61                        ph  log_alcohol  0.229    [0.18, 0.28]  <0.001
19      log_volatile_acidity  log_alcohol -0.223  [-0.27, -0.17]  <0.001
63             log_sulphates  log_alcohol  0.152     [0.1, 0.21]  <0.001
9          log_fixed_acidity  log_alcohol -0.101  [-0.15, -0.05]   0.007
28               citric_acid  log_alcohol  0.100    [0.05, 0.15]   0.007
49   log_free_sulfur_dioxide  log_alcohol -0.085  [-0.14, -0.03]   0.039
36        log_residual_sugar  log_alcohol  0.069    [0.01, 0.12]   0.191

4.4.2 Sequential Feature Selection: All Features

Let’s do sequential feature selection and investigate how many features might be enough.

Python code
formula_log_full = (
  'log_alcohol ~ '
  'log_total_sulfur_dioxide + log_chlorides + log_sulphates + ' 
  'log_residual_sugar + log_free_sulfur_dioxide  + '
  'log_volatile_acidity + log_fixed_acidity + '
  'density + citric_acid + ph + quality'
)
Python code
formula_full = (
  'alcohol ~ '
  'total_sulfur_dioxide + chlorides + sulphates + ' 
  'residual_sugar + free_sulfur_dioxide  + '
  'volatile_acidity + fixed_acidity + '
  'density + citric_acid + ph + quality'
)
Python code
np.random.seed(251)
sfs_res1 = do_sfs_linear_regression(formula_full, wine_train)
show_sfs_results_lin_reg(sfs_res1, "(No Log-Transformed Variables Included)");
Forward Feature Selection with 5-fold CV 
(No Log-Transformed Variables Included)

Selected variables:
[ n = 9, avg. R² = 0.672 ]

1. sulphates
2. residual_sugar
3. free_sulfur_dioxide
4. volatile_acidity
5. fixed_acidity
6. density
7. citric_acid
8. ph
9. quality

Python code
np.random.seed(251)
sfs_res2 = do_sfs_linear_regression(formula_full, wine_train, False)
show_sfs_results_lin_reg(sfs_res2, "(No Log-Transformed Variables Included)");
Backward Feature Selection with 5-fold CV 
(No Log-Transformed Variables Included)

Selected variables:
[ n = 7, avg. R² = 0.667 ]

1. sulphates
2. residual_sugar
3. free_sulfur_dioxide
4. fixed_acidity
5. density
6. ph
7. quality

Traceback (most recent call last):
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\backends\backend_qt.py", line 454, in _draw_idle
    self.draw()
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py", line 405, in draw
    self.figure.draw(self.renderer)
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\artist.py", line 74, in draw_wrapper
    result = draw(artist, renderer, *args, **kwargs)
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\artist.py", line 51, in draw_wrapper
    return draw(artist, renderer)
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\figure.py", line 3082, in draw
    mimage._draw_list_compositing_images(
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\image.py", line 131, in _draw_list_compositing_images
    a.draw(renderer)
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\artist.py", line 51, in draw_wrapper
    return draw(artist, renderer)
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 3064, in draw
    self._update_title_position(renderer)
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\axes\_base.py", line 2997, in _update_title_position
    if (ax.xaxis.get_ticks_position() in ['top', 'unknown']
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\axis.py", line 2399, in get_ticks_position
    self._get_ticks_position()]
  File "C:\Users\ViG\anaconda3\lib\site-packages\matplotlib\axis.py", line 2111, in _get_ticks_position
    major = self.majorTicks[0]
IndexError: list index out of range

Python code
np.random.seed(251)
sfs_res3 = do_sfs_linear_regression(formula_log_full, wine_train_log)
show_sfs_results_lin_reg(sfs_res3, "(Log-Transformed Variables Included)");
Forward Feature Selection with 5-fold CV 
(Log-Transformed Variables Included)

Selected variables:
[ n = 8, avg. R² = 0.690 ]

1. log_sulphates
2. log_residual_sugar
3. log_free_sulfur_dioxide
4. log_fixed_acidity
5. density
6. citric_acid
7. ph
8. quality

Python code
sfs_res4 = do_sfs_linear_regression(formula_log_full, wine_train_log, False)
show_sfs_results_lin_reg(sfs_res4, "(Log-Transformed Variables Included)");
Backward Feature Selection with 5-fold CV 
(Log-Transformed Variables Included)

Selected variables:
[ n = 6, avg. R² = 0.685 ]

1. log_sulphates
2. log_residual_sugar
3. log_fixed_acidity
4. density
5. ph
6. quality

Despite the fact that algorithm suggests more, but 5-6 variables would be enough as additional features give just a slight improvement. Log-transformed features give a slightly better performance than the features on the original scale.

Python code
n_features = 6

print("No log-transformed features")
No log-transformed features
Python code
get_sfs_performance(sfs_res2, n_features)
{'n_features': 6, 'mean R²': 0.664, 'median R²': 0.661, 'sd R²': 0.014}
Python code
sfs_res2.get_metric_dict()[n_features]['feature_names']
('sulphates', 'residual_sugar', 'fixed_acidity', 'density', 'ph', 'quality')
Python code
print("With log-transformed features")
With log-transformed features
Python code
get_sfs_performance(sfs_res4, n_features)
{'n_features': 6, 'mean R²': 0.685, 'median R²': 0.667, 'sd R²': 0.031}
Python code
selected_features_1 = sfs_res4.get_metric_dict()[n_features]['feature_names']
selected_features_1
('log_sulphates', 'log_residual_sugar', 'log_fixed_acidity', 'density', 'ph', 'quality')
Python code
formula_selected_1 = "log_alcohol ~ " + " + ".join(selected_features_1)

4.4.3 Sequential Feature Selection: No Quality

Quality score seems to be the most expensive feature and might be not available at prediction time. So let’s remove it from options to select features.

Python code
formula_log_q = (
  'log_alcohol ~ '
  'log_total_sulfur_dioxide + log_chlorides + log_sulphates + ' 
  'log_residual_sugar + log_free_sulfur_dioxide  + '
  'log_volatile_acidity + log_fixed_acidity + '
  'density + citric_acid + ph'
)
Python code
formula_q = (
  'alcohol ~ '
  'total_sulfur_dioxide + chlorides + sulphates + ' 
  'residual_sugar + free_sulfur_dioxide  + '
  'volatile_acidity + fixed_acidity + '
  'density + citric_acid + ph'
)
Python code
np.random.seed(251)
sfs_res1q = do_sfs_linear_regression(formula_q, wine_train)
show_sfs_results_lin_reg(sfs_res1q, "(No Log-Transformed Variables Included)");
Forward Feature Selection with 5-fold CV 
(No Log-Transformed Variables Included)

Selected variables:
[ n = 7, avg. R² = 0.655 ]

1. total_sulfur_dioxide
2. sulphates
3. residual_sugar
4. fixed_acidity
5. density
6. citric_acid
7. ph

Python code
np.random.seed(251)
sfs_res2q = do_sfs_linear_regression(formula_q, wine_train, False)
show_sfs_results_lin_reg(sfs_res2q, "(No Log-Transformed Variables Included)");
Backward Feature Selection with 5-fold CV 
(No Log-Transformed Variables Included)

Selected variables:
[ n = 5, avg. R² = 0.647 ]

1. sulphates
2. residual_sugar
3. fixed_acidity
4. density
5. ph

Python code
np.random.seed(251)
sfs_res3q = do_sfs_linear_regression(formula_log_q, wine_train_log)
show_sfs_results_lin_reg(sfs_res3q, "(Log-Transformed Variables Included)");
Forward Feature Selection with 5-fold CV 
(Log-Transformed Variables Included)

Selected variables:
[ n = 7, avg. R² = 0.681 ]

1. log_total_sulfur_dioxide
2. log_sulphates
3. log_residual_sugar
4. log_fixed_acidity
5. density
6. citric_acid
7. ph

Python code
sfs_res4q = do_sfs_linear_regression(formula_log_q, wine_train_log, False)
show_sfs_results_lin_reg(sfs_res4q, "(Log-Transformed Variables Included)");
Backward Feature Selection with 5-fold CV 
(Log-Transformed Variables Included)

Selected variables:
[ n = 5, avg. R² = 0.673 ]

1. log_sulphates
2. log_residual_sugar
3. log_fixed_acidity
4. density
5. ph

Despite the fact that the algorithm suggests using more features, it seems that 5 features would be enough as additional features give just a slight improvement. Again, log-transformed features give a slightly better performance than the features on the original scale.

Python code
n_features_q = 5

print("No log-transformed features")
No log-transformed features
Python code
get_sfs_performance(sfs_res2q, n_features_q)
{'n_features': 5, 'mean R²': 0.647, 'median R²': 0.641, 'sd R²': 0.021}
Python code
sfs_res2q.get_metric_dict()[n_features_q]['feature_names']
('sulphates', 'residual_sugar', 'fixed_acidity', 'density', 'ph')
Python code
print("With log-transformed features")
With log-transformed features
Python code
get_sfs_performance(sfs_res4q, n_features_q)
{'n_features': 5, 'mean R²': 0.673, 'median R²': 0.666, 'sd R²': 0.034}
Python code
selected_features_2 = sfs_res4q.get_metric_dict()[n_features_q]['feature_names']
selected_features_2
('log_sulphates', 'log_residual_sugar', 'log_fixed_acidity', 'density', 'ph')
Python code
formula_selected_2 = "log_alcohol ~ " + " + ".join(selected_features_2)

4.4.4 Compare Null, Full and Selected Models

Create scaled data for feature importance.

Python code
wine_train_log_num = wine_train_log.select_dtypes(include="number")

scaler_log = StandardScaler().fit(wine_train_log_num)
df_log_scaled = scaler_log.transform(wine_train_log_num)
df_log_scaled = pd.DataFrame(df_log_scaled, columns=wine_train_log_num.columns)

Feature importance in the null model:

Python code
formula_log_null = 'log_alcohol ~ 1'

model_null = smf.ols(formula_log_null, df_log_scaled)
results_null = model_null.fit()
# print(results_null.summary())
# print(f"R² = {results_null.rsquared_adj:.3f}")
# print(" ")
# print(results_null.params.sort_values(key=abs, ascending=False))

Feature importance in the selected model (with quality score):

Python code
model_selected_1 = smf.ols(formula_selected_1, df_log_scaled)
results_selected_1 = model_selected_1.fit()
print(results_selected_1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            log_alcohol   R-squared:                       0.692
Model:                            OLS   Adj. R-squared:                  0.690
Method:                 Least Squares   F-statistic:                     475.4
Date:                Sat, 04 Feb 2023   Prob (F-statistic):          1.23e-320
Time:                        16:02:46   Log-Likelihood:                -1062.6
No. Observations:                1279   AIC:                             2139.
Df Residuals:                    1272   BIC:                             2175.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept          -2.557e-14      0.016  -1.64e-12      1.000      -0.031       0.031
log_sulphates          0.1652      0.017      9.613      0.000       0.132       0.199
log_residual_sugar     0.4324      0.018     24.160      0.000       0.397       0.467
log_fixed_acidity      0.9256      0.031     29.527      0.000       0.864       0.987
density               -1.1012      0.027    -41.437      0.000      -1.153      -1.049
ph                     0.5669      0.023     24.407      0.000       0.521       0.612
quality                0.1376      0.018      7.536      0.000       0.102       0.173
==============================================================================
Omnibus:                      128.690   Durbin-Watson:                   1.946
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              209.744
Skew:                           0.703   Prob(JB):                     2.85e-46
Kurtosis:                       4.399   Cond. No.                         4.15
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Python code
print(f"R² = {results_selected_1.rsquared_adj:.3f}")
R² = 0.690
Python code
print(" ")
Python code
print(results_selected_1.params.sort_values(key=abs, ascending=False))
density              -1.101
log_fixed_acidity     0.926
ph                    0.567
log_residual_sugar    0.432
log_sulphates         0.165
quality               0.138
Intercept            -0.000
dtype: float64

Feature importance in the selected model (without quality score):

Python code
model_selected_2 = smf.ols(formula_selected_2, df_log_scaled)
results_selected_2 = model_selected_2.fit()
print(results_selected_2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            log_alcohol   R-squared:                       0.678
Model:                            OLS   Adj. R-squared:                  0.677
Method:                 Least Squares   F-statistic:                     535.6
Date:                Sat, 04 Feb 2023   Prob (F-statistic):          5.66e-310
Time:                        16:02:47   Log-Likelihood:                -1090.5
No. Observations:                1279   AIC:                             2193.
Df Residuals:                    1273   BIC:                             2224.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept          -2.745e-14      0.016  -1.73e-12      1.000      -0.031       0.031
log_sulphates          0.2136      0.016     13.109      0.000       0.182       0.246
log_residual_sugar     0.4584      0.018     25.551      0.000       0.423       0.494
log_fixed_acidity      1.0022      0.030     33.073      0.000       0.943       1.062
density               -1.1839      0.025    -47.875      0.000      -1.232      -1.135
ph                     0.5924      0.023     25.232      0.000       0.546       0.638
==============================================================================
Omnibus:                      128.278   Durbin-Watson:                   1.961
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              212.311
Skew:                           0.695   Prob(JB):                     7.89e-47
Kurtosis:                       4.432   Cond. No.                         3.79
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Python code
print(f"R² = {results_selected_2.rsquared_adj:.3f}")
R² = 0.677
Python code
print(" ")
Python code
print(results_selected_2.params.sort_values(key=abs, ascending=False))
density              -1.184
log_fixed_acidity     1.002
ph                    0.592
log_residual_sugar    0.458
log_sulphates         0.214
Intercept            -0.000
dtype: float64

Feature importance in the full model:

Python code
model_full = smf.ols(formula_log_full, df_log_scaled)
results_full = model_full.fit()
print(results_full.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            log_alcohol   R-squared:                       0.704
Model:                            OLS   Adj. R-squared:                  0.701
Method:                 Least Squares   F-statistic:                     273.9
Date:                Sat, 04 Feb 2023   Prob (F-statistic):               0.00
Time:                        16:02:48   Log-Likelihood:                -1036.3
No. Observations:                1279   AIC:                             2097.
Df Residuals:                    1267   BIC:                             2158.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
============================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept                -2.481e-14      0.015  -1.62e-12      1.000      -0.030       0.030
log_total_sulfur_dioxide    -0.0707      0.028     -2.534      0.011      -0.125      -0.016
log_chlorides               -0.0495      0.018     -2.750      0.006      -0.085      -0.014
log_sulphates                0.1755      0.018      9.789      0.000       0.140       0.211
log_residual_sugar           0.4212      0.018     23.504      0.000       0.386       0.456
log_free_sulfur_dioxide     -0.0049      0.027     -0.183      0.855      -0.057       0.047
log_volatile_acidity         0.1002      0.021      4.726      0.000       0.059       0.142
log_fixed_acidity            0.8079      0.037     21.860      0.000       0.735       0.880
density                     -1.0620      0.029    -37.148      0.000      -1.118      -1.006
citric_acid                  0.1477      0.026      5.619      0.000       0.096       0.199
ph                           0.5402      0.025     22.015      0.000       0.492       0.588
quality                      0.1371      0.019      7.332      0.000       0.100       0.174
==============================================================================
Omnibus:                      121.513   Durbin-Watson:                   1.964
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              192.747
Skew:                           0.681   Prob(JB):                     1.40e-42
Kurtosis:                       4.326   Cond. No.                         5.84
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Python code
print(f"R² = {results_full.rsquared_adj:.3f}")
R² = 0.701
Python code
print(" ")
Python code
print(results_full.params.sort_values(key=abs, ascending=False))
density                    -1.062
log_fixed_acidity           0.808
ph                          0.540
log_residual_sugar          0.421
log_sulphates               0.176
citric_acid                 0.148
quality                     0.137
log_volatile_acidity        0.100
log_total_sulfur_dioxide   -0.071
log_chlorides              -0.050
log_free_sulfur_dioxide    -0.005
Intercept                  -0.000
dtype: float64

4.4.5 Model Diagnostics

Python code
lm = results_selected_2
Python code
(results_selected_2.get_influence().cooks_distance[0] > 1).any()
False
Python code
plt.clf()
fig = sm.graphics.plot_partregress_grid(lm);
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
eval_env: 1
Python code
fig.tight_layout();
plt.show()

Python code
plt.clf()
fig = sm.graphics.plot_ccpr_grid(lm)
fig.tight_layout(pad=1.0)
plt.show()

Python code
name = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value']
test = sms.het_breuschpagan(lm.resid, lm.model.exog)
[*zip(name, test)]
[('Lagrange multiplier statistic', 90.84904277133138), ('p-value', 4.455957158445557e-18), ('f-value', 19.467363257890643), ('f p-value', 1.0150662334875106e-18)]

5 Looker Studio Dashboard

A part of this project was to create Looker Studio dashboard. A dashboard to explore some properties of red wine were introduced. You can find a print screen of the dashboard and a link to it below.

Link to the dashboard.

6 Limitations and Suggestions on Improvement

  • It would be easier to maintain code if it would be written in a single language (either R or Python). Bilingual code requires expert who knows both languages.

References

Cortez, Paulo, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53. https://doi.org/10.1016/j.dss.2009.05.016.
Red Wine Quality — Kaggle.com.” https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009/versions/2.
Wine Quality Datasets — www3.dsi.uminho.pt.” http://www3.dsi.uminho.pt/pcortez/wine/.